System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm

ABSTRACT

A system and method for generating training data for a machine learning system. A training data generator server sends at least one keyword to a search engine. The training data generator server receives at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword. The training data generator server assigns a first label to the first page based on the first rank; and assigns a second label to the second page based on the second rank. The first web page, second page, first label and second label are forwarded to a machine learning server.

This application claims priority to U.S. Patent application Ser. No.61/093,586 entitled “Techniques for Automated Search Rank Function,Approximation, Rank Improvement Recommendations and Predictions”, filedSep. 2, 2008, the entirety of which is hereby incorporated by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This disclosure relates to machine learning algorithms and, moreparticularly, to generation of training data for machine learningalgorithms.

2. Description of the Related Art

Referring to FIG. 1, the World Wide Web (“WWW”) is a distributeddatabase including literally billions of pages accessible through theInternet. Searching and indexing these pages to produce useful resultsin response to user queries is constantly a challenge. A search engineis typically used to search the WWW.

A typical prior art search engine 20 is shown in FIG. 1. Pages from theInternet or other source 22 are accessed through the use of a crawler24. Crawler 24 aggregates pages from source 22 to ensure that thesepages are searchable. Many algorithms exist for crawlers and in mostcases these crawlers follow links in known hypertext documents to obtainother documents. The pages retrieved by crawler 24 are stored in adatabase 36. Thereafter, these pages are indexed by an indexer 26.Indexer 26 builds a searchable index of the pages in a database 34. Forexample, each web page may be broken down into words and respectivelocations of each word on the page. The pages are then indexed by thewords and their respective locations.

In use, a user 32 sends a search query to a dispatcher 30. Dispatcher 30compiles a list of search nodes in cluster 28 to execute the query andforwards the query to those selected search nodes. The search nodes insearch node cluster 28 search respective parts of the index 34 andreturn search results along with a document identifier to dispatcher 30.Dispatcher 30 merges the received results to produce a final result setdisplayed to user 32 sorted by ranking scores based on a rankingfunction.

The ranking further is a function of the query itself and the type ofpage produced. Factors that are used for relevance include hundreds offeatures extracted, collected or identified for each page including: astatic relevance score for the page such as link cardinality and pagequality, superior parts of the page such as titles, metadata and pageheaders, authority of the page such as external references and the“level” of the references, the GOOGLE page rank algorithm, and pagestatistics such as query term frequency in the page, words on a page,global term frequency, term distances within the page, etc.

The use of search engines has become one of the most popular onlineactivities with billions of searches being performed by users everymonth. Search engines are also a starting point for consumers forshopping and various day to day purchases and activities. With billionsof dollars being spent by consumers online, it has become ever moreimportant for web sites to organize and optimize their web pages in aneffort to be more visible and accessible to users of a search engine.

As discussed above, for each web page, hundreds of features areextracted and a ranking function is applied to those features to producea ranking score. A merchant with a web page would like his page to beranked higher in a result set based on relevant search keywords comparedwith web pages of his competitor for the same keywords. For example, fora merchant selling telephones, that merchant would like his web page toacquire a higher ranking score, and appear higher in a result setproduced by a search engine based on the keyword query “telephone” thanthe ranking scores of web sites of his competitors for the same keyword.There are some prior art solutions available to guess the rankingalgorithm used by a search engine and to provide recommendations aboutimprovements that can be made to web pages so that the ranking score fora web page relating to particular keywords may improve. However, most ofthese systems use manual human judgment and historical knowledge aboutsearch engines. Humans must be trained to perform this analysis. Thebasis for these judgments are mostly guesses or arrived at by trial anderror. Consequently, most prior art solutions are inaccurate, timeconsuming, and require expensive human capital. Moreover, thesesolutions are available only for specific search engines and are notimmune to changes in search or ranking algorithms used by known searchengines nor do they have the ability to adapt to new search engines.

SUMMARY OF THE INVENTION

One embodiment of the invention is a method for generating training datafor a machine learning system. The method comprises sending at least onekeyword to a search engine; and receiving at a first processor at leasta first and a second page from the search engine in response to thekeyword, the first page having a first rank, the second page having asecond rank, the first and second rank being based on the keyword. Themethod further comprises assigning at the first processor a first labelto the first page based on the first rank; assigning at the firstprocessor a second label to the second page based on the second rank;and forwarding the first web page, second page, first label and secondlabel to a machine learning processor.

Another embodiment of the invention is a method for generating trainingdata for a machine learning system. The method comprises sending atleast one input to a system effective to perform a process; andreceiving at a first processor at least a first and a second output fromthe system in response to the input, the first output having a firstrank, the second output having a second rank, the first and second rankbeing based on the input. The method further comprises assigning at thefirst processor a first label to the first output based on the firstrank; assigning at the first processor a second label to the secondoutput based on the second rank; and forwarding the first result, secondresult, first label and second label to a machine learning processor.

Yet another embodiment of the invention is a system for generatingtraining data for a machine learning system. The system comprises afirst processor effective to send at least one keyword to a searchengine. The first processor is further effective to: receive at least afirst and a second page from the search engine in response to thekeyword, the first page having a first rank, the second page having asecond rank, the first and second rank being based on the keyword;assign a first label to the first page based on the first rank; andassign a second label to the second page based on the second rank. Thesystem further comprises a machine learning processor connected to thefirst processor, the machine learning processor effective to receive thefirst web page, second web page, first label and second label.

Still another embodiment of the invention is a computer readable storagemedium including computer executable code effective to generate trainingdata for a machine learning system. The code includes the steps ofsending at least one keyword to a search engine; and receiving at afirst processor at least a first and a second page from the searchengine in response to the keyword, the first page having a first rank,the second page having a second rank, the first and second rank beingbased on the keyword. The code further includes the steps of assigningat the first processor a first label to the first page based on thefirst rank; assigning at the first processor a second label to thesecond page based on the second rank; and forwarding the first web page,second page, first label and second label to a machine learningprocessor.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings constitute a part of the specification and includeexemplary embodiments of the present invention and illustrate variousobjects and features thereof.

FIG. 1 is a system drawing a search engine in accordance with the priorart.

FIG. 2 is a system drawing of a machine learning system in accordancewith an embodiment of the invention.

FIG. 3 is a schematic drawing of a database structure in accordance withan embodiment of the invention.

FIG. 4 is a flow chart of a process which could be used in accordancewith a embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

Various embodiments of the invention are described hereinafter withreference to the figures. Elements of like structures or function arerepresented with like reference numerals throughout the figures. Thefigures are only intended to facilitate the description of the inventionor as a limitation on the scope of the invention. In addition, an aspectdescribed in conjunction with a particular embodiment of the inventionis not necessarily limited to that embodiment and can be practiced inconjunction with any other embodiments of the invention.

When applying a ranking function, search engines receive as input: 1) atleast one keyword and 2) a plurality of web pages in a result setproduced based on keyword(s). With those inputs, the search engineproduces as an output a ranking score for each web page. The inventorsrecognized this phenomenon and produced a system and algorithm toreverse engineer the function performed by search engines to producethat output. Stated another way, search engines perform the followingranking function to generate a ranking score for each page in a resultset:

ranking score=F(input)

where the input is the search query in the form of keyword(s) and theextracted features of the pages in the result set. The present systemand method determines training data that may be used to determine thefunction F used by a search engine.

In order to approximate the ranking function, training data may be sentto a machine learning system. Generating such training data is perhapsthe most difficult and labor intensive part of any machine learningsystem. As discussed above, prior art techniques for generating trainingdata include the use of teams of humans subjectively viewing selectedportions of available data such as keywords and result sets. Even ifcollection of data may be automated, in the prior art, labeling of thedata is performed manually. Such labeling techniques are ofteninaccurate as they are subject to human judgment of a complex systemsuch as a search engine. A human being typically cannot judge byintuition whether he has collected all kinds of different search resultsto ensure that the training data is diverse and it is generally notpossible to manually track or generate a diverse set of data. A diversetraining set is desired for a machine learning algorithm to work well.Moreover, human labeling in not accurate because it is generally notpossible to judge a label value by intuition.

Referring to FIG. 2, there is shown a system 80 in accordance with anembodiment of the invention. System 80 includes a training datagenerator server 60. Training data generator server or processor 60sends keywords 62 over a network 64 (such as the Internet) to a searchengine server 66. Keywords 62 could be virtually any set of keywordsthat, when input to a search engine, yield web pages in a result set. Itis desirable to generate a number of different sets of keywords. Manytechniques could be used to generate such sets. For example, keywordtools provided by search engines such as the MSN Keyword tool, or theGOOGLE ADWORDs tool could be used, third party tools which monitor andcollect keywords based on popularity usage and other metrics may beused, or statistical analysis may be used to determine importantkeywords from web pages and web logs. For example, by collecting thefrequency distribution of keywords from web pages and web logs, it maybe possible to identify important keywords from pages. Keywords 62 aresent by search engine server 66 to a search engine index 68.

Search engine index 68 outputs web pages 70 that are responsive to asearch query including keywords 62. Search engine server 66 receives webpages 70 and orders or ranks web pages 70 based on an unknown rankingalgorithm to produce ranked web pages 76.

Ranked web pages 76 are sent over network 64 and fed to training datagenerator server 60. Training data generator server 60 stores ranked webpages 76 and labels 82 for those pages in a training data storage 84. Alabel 82 is associated with each ranked web page 76 corresponding to therank of the ranked web page 76 based on keyword 62. Label 82 allowssystem 80 to represent the relevance of each ranked web page 76 tokeywords 62. Prior art labeling techniques required manually intensive,inaccurate and expensive human capital. Humans would view each rankedweb page 76 and provide an appropriate label. The inventors havedetermined that a linear distribution of the ranking scores is a goodrepresentation of those scores. Consequently, if L ranked web pages 76are considered, the highest ranked web page is given a label L, thesecond highest is given a label L-1, etc.

Referring to FIGS. 2 and 3, there is shown an example of a training datastructure 110 which may be stored in training data storage 84. As shown,for a keyword 112 (“telephone” is shown) training data structure 110 mayinclude a label column 114 and a web page column 118. Label column 114includes labels 116 for ranked web pages 76 (FIG. 2). The web pagesthemselves may be stored in web page column 118. The contents oftraining data structure 110 may be forwarded and used as training datain a machine learning server or processor 74. Machine learning server 74may use any known machine learning techniques on training data 110 toproduce an approximated ranking function 88.

Referring to FIG. 4, there is shown a flow chart of a process whichcould be used in accordance with an embodiment of the invention. Theprocess could be used with, for example, system 80 described withrespect to FIG. 2. As shown at step S2, at least one input or keyword issent to a search engine or any other system implementing a process. Atstep S4, the search engine queries a search engine index using thekeyword to produce a result set including web pages or the process usesthe keywords as input or produce an output. At step S6, the searchengine ranks the web pages or the process ranks the output. At step S8,the search engine or process forwards the inputs or keywords and rankedweb pages or outputs to a training data server or processor. At stepS10, the training data server assigns a label to each page or outputbased on the rank. At step S12, the labels and pages or outputs are usedas training data.

Clearly, although different servers are shown for various elements thoseservers could be combined in a single processor housing or location.

A system in accordance with that described above can be used to collecttraining data for any search engine. Moreover, the system can adaptautomatically to changes in ranking functions of existing search enginesand produce new training data accordingly. Prior art systems aresignificantly limited in that subjective, expensive human capital isused to analyze only samples of available data. A system in accordancewith the invention could analyze one page or thousands of pages easilyand efficiently.

The invention has been described with reference to an embodiment thatillustrates the principles of the invention and is not meant to limitthe scope of the invention. Modifications and alterations may occur toothers upon reading and understanding the preceding detaileddescription. It is intended that the scope of the invention be construedas including all modifications and alterations that may occur to othersupon reading and understanding the preceding detailed descriptioninsofar as they come within the scope of the following claims orequivalents thereof. Various changes may be made without departing fromthe spirit and scope of the invention.

Although the above description is focused on the search engine context,the inventive concepts may be applied to any function approximationsystem where the inputs and outputs are known.

As can be discerned, the system and process described above is moreaccurate than human labeling because, in part, results of the unknownprocess, such as search engine ranking, are used. As the system isautomated, it is possible to easily collected large amounts of trainingdata without manual intervention. Ranking algorithms produced inaccordance with the invention are change resistant. This is becausetraining data is based on search results. If any search engine changesits ranking algorithm the results will change and the training data willchange. Prior art systems based on intuition and prior knowledge ofhumans cannot adapt as easily. The system works with known and to bedeveloped search engines and can easily be applied to specific sitessuch as TRAVELOCITY.COM.

1. A method for generating training data for a machine learning system,the method comprising: sending at least one keyword to a search engine;receiving at a first processor at least a first and a second page fromthe search engine in response to the keyword, the first page having afirst rank, the second page having a second rank, the first and secondrank being based on the keyword; assigning at the first processor afirst label to the first page based on the first rank; assigning at thefirst processor a second label to the second page based on the secondrank; and forwarding the first web page, second page, first label andsecond label to a machine learning processor.
 2. The method as recitedin claim 1, wherein the first and second labels are based on a lineardistribution of a ranking of the first and second pages by the searchengine.
 3. The method as recited in claim 1, wherein the pages are webpages.
 4. The method as recited in claim 1, wherein the keyword isgenerated using at least one of an MSN keyword tool, GOOGLE ADWORDS, anda statistical analysis of keywords from web pages.
 5. A method forgenerating training data for a machine learning system, the methodcomprising: sending at least one input to a system effective to performa process; receiving at a first processor at least a first and a secondoutput from the system in response to the input, the first output havinga first rank, the second output having a second rank, the first andsecond rank being based on the input; assigning at the first processor afirst label to the first output based on the first rank; assigning atthe first processor a second label to the second output based on thesecond rank; and forwarding the first result, second result, first labeland second label to a machine learning processor.
 6. The method asrecited in claim 5, wherein the first and second labels are based on alinear distribution of a ranking of the first and second pages by thesearch engine.
 7. The method as recited in claim 5, wherein the pagesare web pages.
 8. The method as recited in claim 5, wherein the keywordis generated using at least one of an MSN keyword tool, GOOGLE ADWORDS,and a statistical analysis of keywords from web pages.
 9. A system forgenerating training data for a machine learning system, the systemcomprising: a first processor effective to send at least one keyword toa search engine; the first processor further effective to: receive atleast a first and a second page from the search engine in response tothe keyword, the first page having a first rank, the second page havinga second rank, the first and second rank being based on the keyword;assign a first label to the first page based on the first rank; andassign a second label to the second page based on the second rank; and amachine learning processor connected to the first processor, the machinelearning processor effective to receive the first web page, second webpage, first label and second label.
 10. The system as recited in claim9, wherein the first and second labels are based on a lineardistribution of a ranking of the first and second pages by the searchengine.
 11. The system as recited in claim 9, wherein the pages are webpages.
 12. The system as recited in claim 9, wherein the keyword isgenerated using at least one of an MSN keyword tool, GOOGLE ADWORDS, anda statistical analysis of keywords from web pages.
 13. A computerreadable storage medium including computer executable code effective togenerate training data for a machine learning system, the code includingthe steps of: sending at least one keyword to a search engine; receivingat a first processor at least a first and a second page from the searchengine in response to the keyword, the first page having a first rank,the second page having a second rank, the first and second rank beingbased on the keyword; assigning at the first processor a first label tothe first page based on the first rank; assigning at the first processora second label to the second page based on the second rank; andforwarding the first web page, second page, first label and second labelto a machine learning processor.
 14. The storage medium as recited inclaim 13, wherein the first and second labels are based on a lineardistribution of a ranking of the first and second pages by the searchengine.
 15. The storage medium as recited in claim 13, wherein the pagesare web pages.
 16. The storage medium as recited in claim 13, whereinthe keyword is generated using at least one of an MSN keyword tool,GOOGLE ADWORDS, and a statistical analysis of keywords from web pages.